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Was ist Computerlinguistik? 



Was ist "Computerlinguistik" 
für *dich*? 
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Definitionen 



Definition gem. Universität Zürich UZH 



Die Computerlinguistik ist eine junge Disziplin im 
Uberschneidungsbereich von Sprachforschung und Informatik. Sie 
untersucht, 

■ wie die menschliche Sprache als Mittel zur Übermittlung, 
Speicherung und Verarbeitung von Information verwendet wird, 
und 

■ wie man diese Prozesse auf dem Computer modellieren und für 
konkrete Anwendungen nutzbar machen kann. 

Dies geschieht primär aus theoretischem Interesse. [...] 



1 Vgl. Webseite: http://www.cl.uzh.ch/what-is-cl.html 
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Definitionen 



Definition auf der Wikipedia DE 2 



In der Computerlinguistik (CL) oder linguistischen 
Datenverarbeitung (LDV) wird untersucht, wie natürliche 
Sprache in Form von Text- oder Sprachdaten mit Hilfe des 
Computers algorithmisch verarbeitet werden kann. Sie ist 
Schnittstelle zwischen Sprachwissenschaft und Informatik. 



2 Vgl. Webseite: https : //de .wikipedia. org/w/index.php?title= 
Computerlinguistik&oldid=139308216 
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Definitionen 



Definition auf der Wikipedia EN : 



Computational linguistics is an interdisciplinary field 
concerned with the Statistical or rule-based modeling of 
natural language from a computational perspective. 

[■■■] ^ " 

Computational linguistics has theoretical and applied 
components, where theoretical computational linguistics 
takes up issues in theoretical linguistics and cognitive 
science, and applied computational linguistics focuses on 
the practica! outcome of modeling human language use. 



3 Vgl. Webseite: https : //en. wikipedia. org/w/index.php?title= 
Computational_linguistics&oldid=663599748 
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Beispiele: Aufgaben in der Computerlinguistik 



■ Tokenisierung 

■ Wortartenerkennung 

■ Named Entity Recognition (NER) 

■ Parsing (by Anerkennung einer Syntax in einem Text) 

■ Koreferenzauflösung 

■ Kollokationen 
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Beispiele: Bereiche der Computerlinguistik 



■ Sentiment Analysis 

■ Machine Translation (regelbasiert, statistisch oder hybrid) 

■ Text Mining / Relationship Mining 

■ Automatic Text Summarization 

■ Authorship Recognition 

■ Topic Analysis 



Hernani Marques 

Computerlinguistik und Massenüberwachung 



Chaos Singularity 2015 @ Biel/Bieni 



Computerlinguistik Massenüberwachung Auswahl: INDECT-Papers Auswahl: Snowden-Fundus Auswahl: WikiLeaks Q&A 

oooo o oo o oo 

o oooo ooooooooooooooo ooooo oooooooo 

om ooo oooooooo 

oo 

Beispiele: Bereiche der Computerlinguistik 



Beispiel Pipeline Apertium (Maschinelle Übersetzung) 



V: xixona.dlsi.^a es ' : .umentation.pdf 

:k access, place your bookmarks here on the bookmarks bar. Import bookmarks now... 

6 CHAPTER1. THE TRANSLATION ENGINE 



text 
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Figure 1.1: The eight modules that build the assembly line of the shallow-transfer 
machine translation System. 
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Die Sensoren des NDB 




Kantonale 

Nachrichtendienststellen IMINT 
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Massenüberwachung in der Schweiz 



Onyx-Funkaufklärung 
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Massenüberwachung in der Schweiz 



Onyx-Auswertung 
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Massenüberwachung in der Schweiz 



6-Monate-VDS via EJPD 



Art. 12 Pflichten der Anbieterinnen 



1 Die Anbieterinnen von Postdiensten sind verpflichtet, der anordnenden 
Behörde die Postsendungen sowie die weiteren Verkehrs- und 
Rechnungsdaten soweit herauszugeben, als es in der 
Überwachungsanordnung umschrieben wird. Sie erteilen der anordnenden 
Behörde auf Verlangen weitere Auskunft über den Postverkehr einer 
Person. 



?rpf lichtet, die Daten, welche eine Teilnehmeridenti 
owie die Verkehrs- und Rechnungsdaten während n 
iten aufzubewahren] 



3 Die Tatsache der Überwachung und alle sie betreffenden Informationen 
unterliegen gegenüber Dritten dem Post- und Fernmeldegeheimnis (Art. 
321 ter StGB 1 ). 
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Massenüberwachung in der Schweiz 



5-Jahres-VDS via VBS 



a https://\ 



in.ch/opc/de/classified-o 



Art. 4 Datenbearbeitung 



1 Das ZEO vernichtet die im Rahmen der Funkaufklärung gewonnenen 
Resultate spätestens im Zeitpunkt der Beendigung des jeweiligen 
Funkaufklärungsauftrags. 

2 Es vernichtet die erfassten Kommunikationen spätestens 18 Monate nach 
deren Erfassung. 



. vernichtet di e erfassten Verbindungsdaten spätestens 5 Jahre 
en Erfassun- ' 



4 Es darf Daten, die aufgrund eines Funkaufklärungsauftrags erfasst worden 
sind, auch zur Erfüllung eines anderen Funkaufklärungsauftrags des 
gleichen Auftraggebers verwenden. 

5 Die Anmeldung von Datensammlungen, das Auskunfts- und Einsichtsrecht 
sowie die Archivierung richten sich nach den für den jeweiligen 
Auftraggeber geltenden rechtlichen Bestimmungen. 
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Beispiel: Massenüberwachung Onyx (1) 



Beispiel: Massen Überwachung Onyx (2) 



Ü www.parlamentxh/d/dokumentation/berichte/berichte-delegationen/berichte-der-geschaeftspruefungsdelegation/D 

:k access, place your bookmarks here on the bookmarks bar. Import bookmarks now... 

3. Februar 1995 über die Armee und die Militärverwaltung (MG; SR 510.10), der 
die Aufgaben des Auslandsnachrichtendienstes im Ausland regelt. 

Onyx nahm seinen Dienst im April 2000 auf und arbeitet zur Zeit im Probebetrieb. 
Der operationelle Betrieb wird im Laufe des Jahres 2004 aufgenommen, die Auf- 
nahme des Vollbetriebs ist auf Ende 2005/Anfang 2006 vorgesehen. 

Das Onyx-System bietet seinem hauptsächlichsten Benutzer, dem Strategischen 
Nachrichtendienst (SND) des Departements für Verteidigung, Bevölkerungsschutz 
und Sport (VBS) bereits heute zahlreiche Funktionen und Möglichkeiten der Infor- 
mationsbeschaffung an. In weniger grossem Umfang dient es auch dem Dienst für 
Analyse und Prävention (DAP) des Eidgenössischen Justiz- und Polizeidepartements 
(EJPD). 

Onyx ermöglicht eine Massenüberwachung von Kommunikationen. Es erleichtert 
die Beschaffung nutzdienlicher Informationen, beispielsweise bei der Bekämpfung 
der Proliferation von Massenvernichtungswaffen (Weapons of Mass Destruction 
[WMD]) oder des internationalen Terrorismus, wobei die diesbezüglichen Kapazitä- 
ten der Nachrichtendienste um ein Vielfaches erhöht werden. 
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Beispiel: Massenüberwachung Onyx (1) 



Beispiel: Massen Überwachung Onyx (3) 



okumentation/berichte/berichte-delegationen/berichte-der-gesc 

, here on the bookmarks bar. Import bookmarks now,„ 



entzifferbaren Inhalts der Kommunikation, um sie damit automatisch zu filtrieren. 
Die Filtrierung erfolgt mit Hilfe von Systemen künstlicher Intelligenz. Diese Syste- 
me vergleichen den Inhalt der Kommunikation mit den vordefinierten Adressie- 
rungselementen und Schlüsselwörtern (s. Abb 2) Meldungen, die keinen dieser 
Kriterien entsprechen, werden automatisch herausgetiltert 



Abbildung 2 

Beispiel der Erfassung einer Telefaxkommunikation /wischen zwei 
Rommunikationsteilnehniern im Ausland 




1517 
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Beispiel: Massenüberwachung Onyx (1) 



Beispiel: Massen Überwachung Onyx (4) 



/.Parlament. ch/d/dokumentation/berichte/bericM^ Q £ 

place your bookmarks here on the bookmarks bar. Import bookmarks now... □ C 

Wenn die GPDel verlangt, dass Kommunikationsabhörungen im MG ausdrücklich 
festgehalten werden, verfolgt sie ein Transparenzziel. Diese Forderung rechtfertigt 
sich weniger auf landesinterner Ebene, da die Abhörung von Kommunikations- 
teilnehmern in der Schweiz verboten ist, sondern vielmehr im Hinblick auf das 
Völkerrecht und insbesondere auf die EMRK 41 . Artikel 8 EMRK lässt Eingriffe in 
die Privatleben nur dann zu, wenn es darum geht, die nationale Sicherheit zu wah- 
ren, und wenn dabei bestimmte Bedingungen wie Bestehen und Zugänglichkeit der 
rechtlichen Grundlage, Verhältnismässigkeit usw. erfüllt werden. Der Europäische 
Gerichtshof für Menschenrechte hat in mehreren Entscheiden darauf hingewiesen, 
dass die Gesetze zur Reglementierung administrativer oder gerichtlicher Abhörun- 
gen der Öffentlichkeit zugänglich und ausreichend genau und ausführlich abgefasst 
sein müssen, so dass die Bürger darauf mit einem adäquaten Verhalten reagieren 
können 42 
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,$wwv 


zeit.de digital/daten 


schutz/2015-04/bundesnachrichtendienst-bnd-r©iP t e| |@t Du 


isited ▼ 


£ Getting Started ( 


Latest Headlines ▼ 



Was sind Selektoren? 



Selektoren sind so etwas wie Suchbegriffe. Das können IP-Adressen, 

Telefonnummern, E-Mail-Adressen sein, genauso wie Geokoordinaten, 

MAC -Adressen, URLs. Aber auch einzelne Suchbegriffe ko nnen ein Selektor sein, 

also Namen oder Kürzel von Firmen und Behörden, oder Ausdrücke wie 

"Eurocopter". 

In die Datenbanken des BND werden somit drei Dinge eingespeist: Die 
abgehörten Daten aus den Leitungen, die von der NSA gelieferten Selektoren und 
die Selektoren, die der BND selbst erstellt hat - denn auch er wühlt 
selbstverständlich in den Daten und sucht nach Interessantem. Als Ergebnis 
liefern die Rechner alle Informationen, die irgendetwas mit einem solchen 
Suchbegriff zu tun haben: Wen der Inhaber einer Telefonnummer angerufen hat, 
wer sich an einem bestimmten Ort aufgehalten hat und so weiter. Das sind die 
sogenannten Positiv-Selektoren, mit denen aktiv nach etwas gesucht wird. 

In der Sprache des BND gibt es aber auch noch Negativ-Selektoren, die wie 
vorgeschaltete Filter funktionieren. Tauchen die Negativ-Selektoren in einem 
Suchergebnis auf, soll die weitere Analyse an dieser Stelle abgebrochen werden. 



Hernani Marques Chaos Singularity 2015 @ Biel/Bienne 

Computerlinguistik und Massenüberwachung 



Computerlinguistik Massenüberwachung Auswahl: INDECT-Papers Auswahl 
oooo o oo o 

o oooo ooooooooooooooo ooooo 

oo ooo 
om 

Beispiel: Massenüberwachung BND+NSA 



Überwachung mit Selektoren (2) 



Snowden-Fundus Auswahl: WikiLeaks Q&A 
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Q wirtschaftsblatt. al 



irichten/europa/47481 26/f JSABF ]D_v 



Dem Bericht zufolge hatten Parlamentarier den Zeugen, der beim 
BND für die Prüfung und Löschung kritischer Selektoren zuständig 
gewesen sei, mit einer Liste von Namen aus dem Archiv des Ex- 
NSA-Mitarbeiters Edward Snowden konfrontiert. Unter den 31 
Einträgen fänden sich Firmen wie Mercedes, Deutsche Bank, der 
Wertpapierdienstleister Clearstream und die 
Telekommunikationsfirma Debitel. Der Mitarbeiter äußerte sich 
den Angaben zufolge aber nicht dazu, ob und wie lange die 
Selektoren aktiv waren und die NSA mithilfe des BND deutsche 
Ziele ausgespäht hat. 
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IN DECT: Wesen 



Was ist INDECT? Showtime! 



Project Indect (EU Surveillance Programme) : Free Download &c Strean 

File Edit View History Bookmarks Tools Help 

j |3| Project Indect (EU Su... * \ + 

9 l* . fll "" : archive.org e t uls/ProjectlndecteuSurveillanceProgramme 1 ▼ C | [® T DuckDuckGo 



0 Project Indect (EU Surveillance Programme) 

Topics EU, Project Indect, leak, Leaked, surveillance, State, totalitarism, civil rights, 

INDECT-400px.ogvä (Ogg multiplexed audio/video file, Theora/Vorbis, length 5m41s, 400Ä222 
Pixels, 962kbps overall) 
[edit] Summary 

Leaked presentation and Propaganda Video for the EU's surveillance Project INDECT. 
Project INDECT is part of the EU's Seventh Framework Programme. 
For intended usage see User:Brian McNeil/Project INDECT 

Run time 5 minutes 41 seconds 
Aua o/Visual 
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IN DECT: Wesen 



Mehr INDECT-Videos 
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INDECT: Auswahl Papers Work Package 4 



Deliverable 4.1: XML Data Corpus (1) 



XML Data Corpus: Report on methodology for collection, cleaning 
and unified representation of large textual data from various sources: 
news reports, weblogs, chat. 
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; Work Package 4 



Deliverable 4.1: XML Data Corpus (2) 



■ Government (Tag: ORG.GOV) 

This subtype refers to entities that are related to governmental affairs, politics, or the 
State. Note that the entire government of a GPE is excluded from this subtype and 
should be classified as GPE.ORG as we will see later. This subtype also includes 
military organizations that are connected to the government of a GPE. Some 
examples are the following: 

[The British navyj announced yesterday that . . . 

[The ministry of culture] has funded our researeh . . . 

' Commercial (Tag: ORG.COM) 

This subtype refers to organizations, which primarily locus on providing produets or 
Services for profit. Some examples are the following: 

[Google 's search enginej is based on Page Rank . . . 

[Apple] announced yesterday the release ofits new iPhone . . . 
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Deliverable 4.1: XML Data Corpus (3) 



D4.1 XML Data Corpus 



4.4.2 News report" 



D 



Among the most serious incidents reported to the (National Criminal Intclligcncc Service 
[ORG.GOV]) NCIS [0RG.GOV] were: 

July 2008: Glasgow Rangers [ORG.SPO] v Shelbourne [ORG.SPO]. Police baton 
[ORG.GOV] charged 150 Rangers supporters [PER.Group] who were trying to attack fans 
of the Irish club [PER.Group]. 

August 2008: Norwich City [ORG.SPO] v QPR [ORG.SPO]: Twenty supporters from both 
sides [ORG.SPO] involved in bottle throwing in a Norfolk [LOC.ADD] pub. One person 
arrested. 

30 September 2008: Norwich City [ORG.SPO] v Birmingham City [ORG.SPO]: Twenty 
Birmingham fans [PER.Group] sprayed rival supporters with CS gas [WEA] and attacked 
them with bar stools in a pub. 

• Identified Events 

o El: [Police baton {charged, 1 150 Rangers supporters who were trying to 

attack fans of the Irish club.] 
o E2: [150 Rangers supporters who were {trying to attack] fans of the Irish 

club] 

o E3: [150 Rangers supporters who were {trying to attack) fans of the Irish 
club.] 
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Deliverable 4.1: XML Data Corpus (4) 



D4.1 XML Data Corpus 



4.4.3 Terrorist Chat 7 



Shazad Tanweer [POR.Individual]: Any extra risks getting into Pakistan [GPE.NAT] ? 

Omar Khyam [PliR.Individual]: We had five Bengalis [GPE.NAT] last year. Guess how we 
|PER.(;roup| got them [(iPE.NAT] in. From Bangladesh [GPE.NAT] all the way across 
India [GPE.NAT] into Pakistan[GPE.NAT]... we [PER.Group] bribed the guy 
[PER.Individual]. You know when you [PER.Individual] go to the check-in, it would all bc set 
up. 

Mohammed Siddique Khan [PER.Individual]: Going through the airport - normal tickets. 

Omar Khyam[PER.Individual]: Yeah, just walk straight through bruv normal, just act as if 
you are a Pakistani [GPE.NAT]. 



Shazad Tanweer [PER.Individual]: I live in Faisalbad [GPE.NAT] 
Omar Khyam [PER.Individual]: That's not a problcm 



Omar Khyam [PER.Individual]: All right bruv [PER.Individual]. Get your parents to pick 
you up. Or your family ... And that way you will breeze through the airport scriously. Even if 
they [ORG.GOV] arc following you [PER.Individual] - it docsn't really count. Chili out, 
proper chill out ... until we [PER.Group] contact you and then wc'll pick you [PER.Individual] 
up. 
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Deliverable 4.1: XML Data Corpus (5) 



Idcntificd Events 

O El: [Guess how we {got} them {in}. From Bangladesh all the way across India into 
Pakistan] 

o E2: [Guess how we jgot[ them |in} From Bangladesh all the way across India into 

Pakistan] 
o E3: We {bribedj the guy. 
o E4: We Jbribcd | the guy. 

o E5: [when you jgo to the check-in j , it would all bc set up.] 
o E6: [Even if they {are following} you] 
o El: [Even if they jare following} you] 



Event ID 


Event Type 


El 


Transportation. Illegal 


E2 


Transportation. Illegal 


E3 


FinancialTransaction. Illegal 
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INDECT: Auswahl Papers Work Package 4 



Deliverable 4.3: Behavioural Profiling (1) 



Report on current state-of-the-art on machine learning methods for 
behavioural profiling 
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Deliverable 4.3: Behavioural Profiling (2) 
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Update dynamic 
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Deliverable 4.3: Behavioural Profiling (3) 
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4. Offenders characteristics profiling methods 

4.1. Introduction 

Modus operandi (Method of Operation) is a Latin phrasc that Ls typically nsed in criminal 
investigation to describe a crime committed by an offender as well as the methods em- 
ployed for committing such a crime. The description of a crime Ls typically written in a 
policc record that might contain a number of struetured and unstruetured data including 
the following: 

• Free text describing the method employed for committing a crime 

• Feature code for the type of a particular offence 

• Feature code for the presense or absence of a specific aspect of behaviour 

• Feature code for the gender of the offender 

• Feature code for the agc of the offender 

• Feature code for the ethnic appearance of the offender 
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Deliverable 4.3: Behavioural Profiling (4) 



D4.3 (5J.pdf | 150% v ] | Q | [ 

4.2. Language modelling 

Bache et al. [28, 2] presented a language modelling racthod for infcrring the charactcr- 
istics of offenders from an existing police archive. Based on the Information included in 
police reports of solved cases their target was to ünk behavioural features with charac- 
terLstics of offenders. They have focused on the following offender characteristics: 

1. G ender 

The Bender of a eriminal agent ran either he male or female. 

2. Age 

The age of a eriminal age is defined to be either below or above the median of the 
ages of offenders committing a series of crimes. 

3. Ethnic appearance 

The ethnic appearance of an offender can either be white European or Afro- 
Caribbcan. 

4. Occupation 

The occupation of an offender can only take two possiblc values, i.e. employed or 
unemployed. 
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Deliverable 4.3: Behavioural Profiling (5) 
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Personal Pronoun 




Figure 7: Ftmctional word taxonomies 



Recently, another similar approach to authorship profiling was presented in Argamon et 
al. [37). Their main difference with the previous method is that their focus is not on 
identifying the author of a particular docnment. In contrast, they exploit free text in 
order to identify an author 's gender. age. native language and neuroticism level. 
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Deliverable 4.11: Mine/Detect suspicious Websites (1) 



Specification of methods for mining and detecting suspicious 
Websites 
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Deliverable 4.11: Mine/Detect suspicious Websites (2) 
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Deliverable 4.11: Mine/Detect suspicious Websites (3) 
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N-grams 

Since terms as keywords are not always accurate, phrases as keywords for document retrieval provides 
an alternative.. A phrase is a sequence of terms in a document. An n-gram is a continuous sequence 
of n terms in a text. Below we show 2-grams and 3-grams representation of the sentence, ''Hand the 
package to the Big Boss". 

2- gram: "hand the" "the package" "package to" "to the" "the Big" "Big Boss" 

3- gram: "hand the package" "the package to" "package to the" "to the Big" "the Big Boss" 

We can see that as we increase n we capture more context relating to words within the phrase. N- 
grams as features for text mining has been used to capture u semantic" information of keywords (4, 5, 6]. 
N-grams also have some distinctive disadvantages. First ly. to better capture the context of any word 
n has to be high. Once the phrase is long, it becomes very sparse within the document collection. 
For example, it will be difficult to find exact match of a 6-gram plnase "Hand the package to the Big 
Boss". Also phrases contains large number of redundant terms. The terms like " to" will hurt the text 
mining process as it will be common in most type of documents. 
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Deliverable 4.11: Mine/Detect suspicious Websites (4) 



INDECT Deliverable D4. 11 (l).pdf 



l(DB 



Dependency parse phrase 



dobj 



pobj 



det | 1 

Hand thie päckäge to the Big Boss 

Figure 4: Dependency parse phrase for Hand the package to the Big Boss 



A dependency parse phrase is constructed following a dependency path of given terms generated by a 
dependency parser. Figure 4 represents the dependency parse of the sentence Hand the package to 
the Big Boss. Dependency phrases have been successfully used in text Classification [7] and relation 
mining [8]. Both text Classification and relation mining share overlapped methodology with text 
inining. In Dl. 5 we successfully used dependency parse phrases for relation niining problem. 
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Deliverable 4.11: Mine/Detect suspicious Websites (5) 
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6. Pattern Matching 

In this step we select patterns which show high association to suspicious Websites than to normal 
Websites. In many suspicious Websites, the sentences containing messages to influence criminal ac- 
tivities are generally grouped within other normal sentences. For example, a suspicious Websites 
can have many factual information and few suspicious lines. Thus, the patterns extracted from 
such suspicious Websites are not all indicative of criminal activities. Most of these patterns will 
also occur in normal Websites. To filter out such normal patterns we use a very simple approach. 
Once we generate patterns from both suspicious Websites and normal Websites. The patterns in- 
dicative of criminal activities are only those which are not present in normal Websites. Thus, we 
select only patterns which are present in suspicious Websites but not in normal Websites. For exam- 



Patterns from suspicious Websites 


Patterns from normal Website 


hand- package-boss 
everest-mountain 
tall-mount ain-world 


everest-mountain 
tall-mountain- world 
temperat ure-cold- winter 



Table 4: Possible patterns generated from suspicious and normal Websites 
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| D www.spiegel.de/media/media-34103.pdf 

k access. place your bookmarks here on the bookmarks bar. 



f edit l A bit more detail 

TEMPORA are GCHQ's large-scale, Deep Dive deployments on Special Source access ( SSE ). 
Deep Dive XKeyscores work by promoüng loose categories of traffic (e.g., all web, email, social, 
chat, EA, VPN, VoIP...) from the bearers feeding the system and block all the high-volume, low 
value traffic (e.g., P2P downloads). This usually equates to -30% of the traffic on the bearer. We 
keep the füll sessions for 3 working days and the metadata for 30 days for you to query, using all 
the functionality that Keyscore offers to slice and dice the data. The aim is to put the best 7.5% of 
our access into TEMPORA's, comprising a mix of Deep Dive Keyscores and promotion of data 
based on IP subnet or technology type from across the entire MVR. At the moment, users are 
able to access 46xlOGs of data via existing Internet Buffers.. This is a lot of data! Not only that, 
but the long-running T1NT program and our initial 3-month operational trial of the CPC Internet 
Buffer (the first operational Internet Buffer to be deployed) show that every area of ops can get 
real benefit from this capability, especially for target discovery and target development. Internet 
Buffers are different from TINT in that the latter is purely an experimental, research environment 
whereas Internet Buffers can be used operationally for Ll'K . L 1 1 ■ ■ i 's , endblinq ('XL etc . 

For a more detailed depiction of how TEMPORA and TINT differs please see here . 
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XKeyscore (1) 



ioom ooi 100. |I010101 Wi TOP SECRET//COMINT//REL TO USA, AUS, CA IM, GB 

What is XKEYSCORE? 




1. DNI Exploitation System/Analytic Framework 

2. Performs strong (e.g. email) and soft (content) selection 

3. Provides real-time target activity (tipping) 



"Rolling Buffer" of ~3 days of ALL unfiltered data seen by 
XKEYSCORE: 

• Stores full-take data at the collection Site - indexed by meta-data 

• Provides a series of Viewers for common data types 

Federated Query System - one query scans all Sites 

• Performing full-take allows analysts to find targets that were 
previously unknown by mining the meta-data 



TOP SECRET//COMINT//REL TO USA, AUS, CAN, GBR, NZL 
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XKeyscore (2) 




• How do I find a strong-selector for a known 
target? 

• How do I find a cell of terrorists that has no 
connection to known strong-selectors? 

• Answer: Look for anomalous events 

• E.g. Someone whose language is out of place for the 
region they are in 

• Someone who is using encryption 

• Someone searching the web for suspicious stuff 



TOP SECRET//COMINT//REL TO USA, AUS, CAN, GBR, NZL 
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XKeyscore (3) 



'°?Mt »oioioi* !, TOP SECRET//COMINT//REL TO USA, AUS, CAN, GBR, NZl 

Google Maps 




• My target uses Google Maps to scope target 
locations - can I use this information to 
determine his email address? What about the 
web-searches - do any stand out and look 
suspicious? 



• XKEYSCORE extracts and databases these events 
including all web-based searches which can be 
retrospectively queried 

• No strong-selector 

• Data volume too high to forward 



TOP SECRET//COMINT//REL TO USA, AUS, CAN, GBR, NZL 
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XKeyscore (4) 



TOP SECRET//COMINT//REL TO USA, AUS, CA IM, GBR, NZL 

Extraction 




• Have technology (thanks to R6) - for 
English, Arabic and Chinese 

• Allow queries like: 

• Show me all the word documents with 
references to IAEO 

• Show me all documents that reference 
Osama Bin Laden 

• Will allow a 'show me more like this' 
capability 



TOP SECRET//COMINT//REL TO USA, AUS, CAN, GBR, NZL 
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Xkeyscore (5) 



5 u' (£2 Ö I ^ https://www.tagesschau.de/inland/nsa-xkeyscore-100.html 

Apps For quick access, place your bookmarks here on the bookmarks bar. Import bookmarks now... 

113.07.2014 18 :2TTJFir 
download der Audiodatei 




So schnell wird man ein "Extremist" 

Siebte Stunde am l< atholischen Gymnasium in Berlin-Neukölln, 13:30 Uhr: An der 
Wand hängen Poster von den Informatikern Tim Berners-Lee und Ada Lovelace. An 
die Tafel hat der Lehrer eine Zwiebel gemalt, daneben steht das Akronym "Tails". 

"Tails", ist ein Betriebssystem, das das Tor-Netzwerk benutzt, um im Internet keine 
Spuren zu hinterlassen, das aber auch nichts vom Nutzerauf dem Computer 
speichert, von dem es, zum Beispiel auf einem USB-Stick, hochgefahren wird. 

DarkoMedic, 18, kurze braune Haare, sitzt vor seinem Laptop. Er gibt "Tails" und 
"USB" in die Maske seiner Suchmaschine ein. Was Darko nicht weiß: Er ist damit 
gerade ebenfalls in einer Datenbank der NSA gelandet. Markiertals einer der 
Extremisten, nach denen die Geheimdienstler so fleißig suchen. 

Denn was die Regeln des Quellcodes ebenfalls verraten: Die NSA beobachtet im 
großen Stil die Suchanfragen weltweit -auch in Deutschland. Allein schon die 
einfache Suche nach Anonymisierungssoftware wie "Tails" reicht aus, um ins Raster 
der NSA zu geraten. Die Verbindung der Anfrage mit Suchmaschinen macht 
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Katalyst-Whitepaper (1) 



kapow 



Kapow Katalyst for OSINT 



Harvest text in any language, images, audio, video from Websites, blogs, and social media. 
Secure and non-attributable. Kapow Katalyst— the best-kept secret in OSINT. 



flickr 

YouflSB 0 



Linkedfffl 



ALJAZEERA 

WordPress 




Harvest Any OSINT Datawith Katalyst 

OSINT data sources are as varied as the Internet itself. 
Mission-crttical data can reside in blogs, in news feeds, in 
social media— and can even be hosted on short-lived Sites 
on the dark web. As technology Standards c< 



Extract Data in any Language 

Katalyst offers built-in support for multi-byte character 
encodings such as Chinese and bidirectional languages 
such as Arabic and Hebrew. Katalyst is in daily use 
throughout the IC to harvest the Contents of news Sites, 
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fl https://www.wikileaks.org/spyfiles/docs/KAPOWSOFTWARE-2011-BuilyourOSIN-en ^fcQ& l Q Q, |_ 

: access, place your bookmarks here on the bookmarks bar. Import bookmarks now... CJ Other bookmar 

Build Your OSINT Capability with Kapow 

Kapow's Extraction Browser is a perfect match for the needs of OSINT. 

Kapow can fully automate data extraction from any Web Site. Without human Intervention, Kapow 
Katalyst can authenticate, perform complex navigation, carry out sophisticated extraction and 
transformation rules, and reliably deliver data wherever it is needed. 

Kapow Katalyst is fully standards-based, allowing smooth exchange of data in any required form. Data 
can also be exchanged via Web Services, Java, or .NET functions. 

Kapow's integrated environment allows you to build even the most sophisticated crawlers without the 
need to develop any code, and test them in real-time— increasing agility and shrinking effort to a 
minimum. 
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Who is Scan & Target? 



Scan & Target 



0 



Scan & Target analyzes digital Communications in real 
time to provide actionable intelligence to Software 
vendors, brands, Service publishers, marketing agencies, 
governments... 



Social 
networks 



Forums, blogs 



Instant Messaging 



Our text Meaning Technology is smart enough to look in real 
time at an incoming text User Generated Content data 
stream, see patterns of interest, and alert the right 
people or trigger the appropriate action- all without 
being queried 
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Scan & Target technology 
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Unlike Solutions based on simple keywords or semantic, our technology 
takes into account the different alterations and variants of 

expressions to analyze the content: 

■ Small/ capital letters use 

■ Letters repetition (vvviiiagrrra for example) 

■ Orthographical variations (vi@gra, vlagra, vl@gra, vl49r4) 

■ Missing letters in some cases (v | agra, v agra...) 

■ Word alteration whatever the use of non alpha symbol (v.i.a.g.r.a, 
v_i°ag#r:a, v-iagra, viagr"a...) 

■ Phonetic alterations 

■ SMS and IM languages 

■ And the combination of these variations 



The Solution is available in English and French and Spanish and 
Arabic (MSA + dialects, Arabic aiphabet + transliteration). 
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The Solution is based on a smart engine that rates not just Single words 
but the entire content as it passes through the filtering engine. Words 
are therefore placed in context to extract meaning 

The Solution applies detailed thematic thesauruses - our Smart 
Wordbooks. Filters are categorized to allow customers to fine-tune the 
analysis (Terrorism/Drugs/Violence, etc.) according to their needs 

Additional analysis layers: sentiment analysis, questions detection... 

Proprietary scoring technology tailored to short digital text contents 

Using a powerful and accurate conditional analysis System, our 
customers experience a very low level of false positives (between 0,05% 
to 0,001% in average) 
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• For homeland security, our API is distributed using 
IBM hardware (to be hosted on your premises) 

• Thanks to our connector, it's very easy to 
implement our API into your own applications 

• You choose how to display our analysis results into 
your interfaces 

• Capacity to deal in real time with Big Data 

- All of Twitter's traffic (10 TB / day, average 1200 Tweets per 
second)* could be analyzed in real time using one IBM blade center 
(for one language) 

- *Source - Twitter 
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Mass interception issues 
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Mass interception of digital text 
Communications, (OSINT or COMINT like SMS, 
e-mails, IM...) is now technically available 



• Issues for intelligence or law enforcement 
agencies: 

- How to deal with the volume (flow never stops) 

- How to find the needle in the digital haystack 
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"Finding the needle" strategies 
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Benefits 


Identified 
Suspects 


Interception 
on keywords 


Indexation 
and search 


Text 
Meaning 


Real time 
information 




+ 




+ 


Fuzzy search 








+ 


Advanced analysis 






+ 


+ 


False positive ratio 








+ 


Unknown threat 
detection 






+ 


+ 


Required analyst 
time 








+ 
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Arabic chat aiphabet 

^^^^M,i..,iiJiJ.j.u.i.,.i..LiM!f^^««w«Bgw^^^^^^^ Scan & Target 

• The Arabic chat aiphabet (Arabish or Arabizi) is 
used to communicate in the Arabic language over 
the Internet or for sending messages via mobile 
phones when the Arabic aiphabet is unavailable 

• Arabic letters are replaced by letters that are 
phonetically equivalent 



• Arabic letters that have no Latin phonetic 
counterpart are represented by numbers, or 
numbers in conjunction with an accent mark 
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Issues with Arabic compared to latin 
languages 
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Language identification issue: 

- MSA, dialects, mix of languages 
Transliteration issue (notably for names) 

- ABD AL-WADOUB 

- ABD EL OUADOUD 

- ABD-AL-WADUD 

- ABDEL EL-WADOUD 

Use of Arabish / Arabizi 

- bri6ania al3o'6ma / britanya al 3ozma = Great Britain 
for example 
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New threat detection 
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Contextual 
alerts 




Target 
identification 
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Messages vs thread 

^^^^Mi..,iiJiJ.j.u.i.,.i..i,iM!f^^i!agw«Bg«^^^^^^^ Scan & Target 

• A web or mobile conversation is a thread of 
messages between 2 or more persons 

• Analysis is first performed at message level for 
contextual alerts 

• When an alert is detected, the associated 
discussion thread is again analyzed to: 

• Increase accuracy and precision 

• Extract investigation elements (names, places, 
nationality, places...) 
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= automatic contextual 
alert sent for potential 
child pornography 
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Use case: drugs traffic 
detection 



Scan & Target 
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Mass Surveillance of SMS Communications (20 to 30 millions per 
day with a lot of different languages, English, Arabic, dialects...) 

Contextual alerts sent to analysts using conditional analysis 
on: 

- Substance related discussions, 

- Transaction related discussions (quantities, money...) 

- Middle men related discussions (dealers, luggage handler, docker, customs...) 

- Smuggling related discussions (places like ports, airports and smuggling tricks) 

Investigation by analyst (conversation thread analysis, social 
network analysis...) identifies: 

- Dealers' ring (pseudo, IP address...) 

- Coded language detection (use of culinary vocabulary for example) 

High precision: 40 alerts per million SMS 
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Recommended Solution 
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Scan & Target text meaning technology is a very efficient 
tool to detect previously unknown terrorist or criminal 
threats on the Internet or wireless networks 



Main benefits: 

- Ability to deal with huge volumes in real time 

- Multilingual and ability to manage fuzzy languages like IM 
or arabizi 

- Actionable intelligence with message &thread analysis 

- Low level of false positive thanks to advanced analysis 

To be integrated into your existing monitoring System 
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